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We study the property of the Fused Lasso Signal Approximator 
(FLSA) for estimating a blocky signal sequence with additive noise. 
We transform the FLSA to an ordinary Lasso problem. By studying 
the property of the design matrix in the transformed Lasso prob- 
lem, we find that the irrepresentable condition might not hold, in 
which case we show that the FLSA might not be able to recover the 
signal pattern. We then apply the newly developed preconditioning 
method - Puffer Transformation [Jia and Rohe, 2012] on the trans- 
formed Lasso problem. We call the new method the preconditioned 
fused Lasso and we give non-asymptotic results for this method. Re- 
sults show that when the signal jump strength (signal difference be- 
tween two neighboring groups) is big and the noise level is small, our 
preconditioned fused Lasso estimator gives the correct pattern with 
high probability. Theoretical results give insight on what controls the 
signal pattern recovery ability - it is the noise level instead of the 
length of the sequence. Simulations confirm our theorems and show 
significant improvement of the preconditioned fused Lasso estimator 
over the vanilla FLSA. 



1. Introduction. Assume we have a sequence of signals (yi, 2/2, • • ■ > Un) and it follows 
the linear model 

(1) y i = jj l * + € i , i = l,2,...,n, 

where Y = (y%, . . . ,y n ) T £ ^ n is the observed signal vector, /x* = (pi, . . . , fi n ) T G M. n 
the expected signal, and e = (ei,... ,e n ) T is the white noise that is assumed to be i.i.d. 
and each has a normal distribution with mean and variance a 2 . The model is assumed 
to be sparse in the sense that the signals come in blocks and only a few of the blocks 
are nonzero. To be exact, there exists a partition of {1,2, ... ,n} = Vjj =l [Lj,Uj\ with 
L\ = l,Uj = n, Uj > Lj, Lj + \ = Uj + 1, and the following stepwise function holds: 

with , Lj, Uj fixed but unknown. We also assume that the vector v = (ui, . . . , v n ) is 
sparse, meaning that only a few of i^-'s are nonzeros. We point out that the Gaussian noise 
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is not necessary. But we still use it to study the insight of the fused Lasso. The variance 
a 2 of ei is the measure of noise level and does not have to be a constant here. For a lot of 
real data problem, each observation of yi can be an average of multiple measurements and 
so a 2 decreases when the number of measurements increases. 

This model featured by blockiness and sparseness has many applications. For example, 
in tumor studies, based on the Comparative Genomic Hybridization(CGH) data, it can be 
used to automatically detect the gains and losses in DNA copies by taking "signals" as the 
log-ratio between the number of DNA copies in the tumor cells and that in the reference 
cells [Tibshirani and Wang, 2008]. 

One way to estimate the unknown parameters is via the Fused Lasso Signal Approxima- 
tor (FLSA) defined as follows [Tibshirani et al., 2004, Friedman et al., 2007]: 

(2) jU( A ij A 2 ) = argmin^||y - + Ai||/i||i + A 2 ||m||tv, 

where ||/x||i = Yh=\ iMiUI^Hl = YTi=\A and IMItv = YTi=\ l/^+i ~ A*t|- The Li-norm 
regularization controls the sparsity of the signal and the total variation seminorm (||//||tv) 
regularization controls the number of blocks (partitions or groups). 

Figure 1 gives one example of the signal sequence and the FLSA estimate on CGH data. 
More details and examples can be seen in Tibshirani and Wang [2008]. 
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FlG 1 . This figure is from Tibshirani and Wang [2008]. The fused Lasso is applied to some CGH data. The 
data are shown in the left panel, and the solid line in the right panel represents the estimated signals by the 
fused Lasso. The horizontal line is for y = 0. 

One important question for the FLSA is how good the estimator defined in Equation 
(2) is. We analyze in this paper if the FLSA can recover the "stepwise pattern" or not. 
We also try to answer the following question: what do we do if the FLSA does not recover 
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the "stepwise pattern" ? To measure how good an estimator is, we introduce the following 
definition of Pattern Recovery. 

Definition 1 (Pattern Recovery). An FLSA solution /t(Ai n ,A2 n ) recovers the signal 
pattern if and only if there exists \\ n and \2 n , such that 

(3) sign(fii +1 (\ ln , \ 2n ) ~ fti(hn A2n)) = sign(fi i+1 - i = l,...,n-l. 

We use fx =j s a to shortly denote Equation (3) (js is the acronym for jump sign). The 
FLSA with the property of pattern recovery means that it can be used to identify both the 
groups and jump directions (up or down) between groups. 

The concept of pattern recovery is very similar to sign recovery of the Lasso. In fact, 
some simple calculations in Section 2 tell us that the pattern recovery property of the 
FLSA can be transformed to the sign recovery property of the Lasso estimator. 

For observation pairs (x^, yf), i = 1, 2, . . . , n with Xi G W and y, L G R, the Lasso is defined 
as follows [Tibshirani, 1996]. 

n 

/3(A) = argmin- V(y 4 - xj /3) 2 + A||/3||i. 
Equivalently, in matrix form, 

(4) /3(A) = argminj||y - Xf3\\ 2 + A||/3||i, 

where Y = (y 1 ,..., y n ) T and X G R nx P with xj as its ith row. We use Xj to denote the 
jth. column of X. 

Sign Recovery of the Lasso estimator is defined as follows. 

Definition 2 (Sign Recovery) . Suppose that data (X, Y) follow a linear model: Y = 
Xf3* + e, where Y = (y 1: . . . , y n ) T , X G M nxp with xj as its ith row, /3* G M pxl and 
e = (ei,...,e n ) T G M nxl with E(ei) = 0. A Lasso estimator /3(A n ) has the sign recovery 
property if and only if there exists X n such that 

(5) sign(j3j(\ n )) = sign(f3*), j = 1, . . . ,p. 

We will use = s (3* to shortly denote sign{j3j{X n )) = sign(f3*), j = I,..., p. The 
Lasso estimator with the sign recovery property implies that it selects the correct set of 
predictors. If P(/3(A„) = s /3) — > 1, as the sample size n — > oo, we say that /3(A n ) is sign 
consistent. 

A rich theoretical literature has studied the consistency of the Lasso, highlighting several 
potential pitfalls [Knight and Fu, 2000, Fan and Li, 2001, Greenshtein and Ritov, 2004, 
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Donoho et al., 2006, Meinshausen and Biihlmann, 2006, Tropp, 2006, Zhao and Yu, 2006, 
Zhang and Huang, 2008, Wainwright, 2009]. The sign consistency of the Lasso requires the 
irrepresentable condition, a stringent assumption on the design matrix [Zhao and Yu, 2006]. 
Now it is well understood that if the design matrix violates the irrepresentable condition, 
the Lasso will perform poorly and the estimation performance will not be improved by 
increasing the sample size. 

Our study of the pattern recovery of the FLSA begins with a transformation that changes 
the FLSA to a special Lasso problem. The data defined in the transformed Lasso prob- 
lem has correlated noise terms instead of independent ones. We prove that even for the 
linear model with correlated noise, the irrepresentable condition is still necessary for sign 
consistency. We then analyze the property of the design matrix in the transformed Lasso 
problem. We give necessary and sufficient condition such that the design matrix in the 
transformed Lasso problem satisfies the irrepresentable condition. We show that, for a spe- 
cial class of models (with special designed stepwise function on fj,*), the irrepresentable 
condition holds. For other signal patterns, the irrepresentable condition does not hold and 
thus the FLSA may fail to keep consistent. A recent paper "Preconditioning to comply 
with the irrepresentable condition" by Jia and Rohe [2012] shows that a Puffer Transfor- 
mation will improve the Lasso and make the Lasso estimator sign consistent under some 
mild conditions. We apply this technique, propose the preconditioned fused Lasso and show 
that it improves the FLSA and recovers the signal pattern with high probability. 

In Rinaldo [2009], the author also considers the consistency conditions for the FLSA. 
They showed that under some conditions, the FLSA can be consistent both in block re- 
construction and model selection. The author says in Rinaldo [2009] that the asymptotic 
results may have little guidance to the practical performance when n is finite. However, 
our method, as we will see, can not only provide mild conditions for the estimator to be 
consistent in block recovery but also give an explicit non-asymptotic lower bound on the 
probability that the true blocks are recovered. Numerical simulations also illustrate that 
in many cases our method turns out to be more effective in block recovery. 

The rest of the paper is organized as follows. In Section 2, we transform the FLSA 
problem into a Lasso problem and analyze the property of the design matrix in the trans- 
formed Lasso problem. Section 3 illustrates when the FLSA can recover the signal pattern 
and when it cannot. In Section 4, we propose a new algorithm called the preconditioned 
fused Lasso that improves the FLSA by the technique of Puffer Transformation (defined in 
Equation (20)). We show that for a wide range designs of the stepwise function on //*, this 
algorithm can recover the signal pattern with high probability. In Section 5, simulations 
are implemented to compare the performances between the preconditioned fused Lasso and 
the vanilla FLSA. Section 6 concludes the paper. Some proofs are given in the appendix. 

2. FLSA and the Lasso. We turn the FLSA problem into a Lasso problem by change 
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of variables. Define the soft thresholding function SH\(x) as 

{x + A x < — X 
-A < x < A . 
x — A x > A 

Let /i(Ai, A2) be the fused Lasso estimator defined in Equation (2). We have the following 
result. 



Lemma 1. 



£(Ai,A 2 )=5tf Al (/}(0,A 2 )) 



The proof of Lemma 1 can be found in Friedman et al. [2007]. From Lemma 1, to study 
the property of p,(Xi, A2), we can set Ai = first. In the whole paper, since pattern recovery 
is our main concern, so we only consider the case when Ai = 0. When Ai = 0, we can solve 
the FLSA by change of variables. Let 9\ = fit, Q% = [i% — fii—i,i = 2, . . . , n. In matrix form, 
we have \x = A9, with 
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So by using 9 instead of //, we have an equivalent solution of /2(0,A2) via the following 
0(X 2 ). 



(7) 



6{X 2 



argmm^||y-^6»[|l + A 2 ||0||i, 
ft 2 



where 9 = (0 2 , 03, • • • , On) G ^ ■ Once we obtain 6>(A 2 ), we have /t(0,A 2 ) = A9(X 2 ). 
Notice the special form of the design matrix A, Expression (7) is a Lasso problem with 
interception. In fact, Expression (7) can be rewritten as 



(8) 

where 9 
(9) 



0(A 2 ) = argmin h\Y - 1 - X9\\ 2 2 + A 2 | 



2, . . . ,9 n ) T and X = (xij) G 



Dnx(n-l) . 
fl i>j 

10 i<j. 
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Define the centered version of X G M™ x (™- 1 ) and Y G M. n as follows. 

(10) X = [X 1 - Xi, . . . - X„_ a ] and y = Y - Y 

with u being the average of the vector u. It is easy to see that Expression (8) is equivalent 
to the following standard Lasso problem without interception. 

(11) 0(A 2 ) = argmmi||y-X6>|| 2 2 + A 2 ||%, and 0i(A 2 ) = Y - X6(\ 2 ). 

§ 2 

Since the observation Y = (y%, . . . ,y n ) follows the model defined in Equation (1). Define 
9* = A^ 1 ^*, (equivalently, 0\ = = /i* — /u*_ l5 i = 2, . . . ,n), where A is defined in 

Equation (6). Let 9* G W 1 ^ 1 = (9%, 6%,..., 9* n ) T . We have that (X, Y) satisfy the following 
linear model: 

y = A9* + e = 9\ + X9* + e, 

where X is defined at Equation (9). Consequently the centered version of (X, Y) satisfy 
the following linear model: 

(12) Y = X9* + e, 

where e = e — e with E{e) = 0. Now we see that 0(A 2 ) defined at (11) has the sign recovery 

property if and only if 0(A 2 ) = s 9*. By the relationship between \x and 9, 0(A 2 ) = s 9* is 
equivalent to /t(0, A 2 ) =j s fi* . In other words, the pattern recovery property of an FLSA 
can be viewed as sign recovery of a Lasso estimator. 

Property 1. The pattern recovery of the FLSA p,(0, A 2 ) defined in Equation (2) is 

equivalent to the sign consistency of the the Lasso estimator #(A 2 ) defined in Equation 
(11). 

Note that this change of variables serves mainly for theoretical analysis rather than 
computational facilitation. Although there are many mature algorithms for the Lasso, 
transforming the FLSA to the Lasso is not recommended in practice because it makes the 
design matrix in (11) much more dense, which is unfavorable to the efficiency of computa- 
tion. Instead, Friedman et al. [2007] develops specialized algorithm for the FLSA based on 
the coordinate- wise descent. Hoefling [2010] generalizes the path algorithm and extends it 
to the general fused Lasso problem. However, in our consistency analysis, this transforma- 
tion works since we can use the well understood techniques on the Lasso to analyze the 
theoretical properties of the FLSA. 

We now turn to analyze the Lasso problem defined in Equation (11). 
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3. The Transformed Lasso. It is now well understood that in a standard linear 
regression problem the Lasso is sign consistent when the design matrix satisfies some strin- 
gent conditions. One such condition is the irrepresentable condition defined as follows. 

Definition 3 (Irrepresentable Condition). The design matrix X satisfies the Irrep- 
resentable Condition for (3* with support S = {j : /3* / 0} if, for some rj £ (0, 1], 



(13) XlXs(X^Xs) 1 sign{p* s ) 



oo 



<i-v, 



where for a vector x, ||x||oo = maxj \xi\, and for T C {1, . . . ,p} with \T\ = t, Xy £ ]R riXt is 
a matrix which containes the columns o/X indexed by T. 

Let A m i n (X) be the minimal eigenvalue of the matrix X and 

C m i n = A min (Xs T Xs) > 0. 

Define 



tf(X,/3*,A) = A 



+ 



[X S T X S ) 1 signal) 



.VCmin maXjg 5c ||A"jf||2 

With the above notation, we have a general non-asymptotic result for the sign recovery of 
the Lasso when data (X, Y) follow a linear model. 

Theorem 1. Suppose that data (X, Y) follow a linear model Y = X/3* + e, where Y = 
(yi, . . . ,y n ) T £ M nxl , X £ R nx P with xf as its ith row, /3* £ RP xl and e = (ei, . . . ,e n ) T £ 
l nxl with e ~ N(0, S e ). Assume that the irrepresentable condition (13) holds. If X satisfies 

M(/3*) > *(X,/T,A), 

then with probability greater than 

x 2 v 2 

the lasso has a unique solution /3(A) with /3(A) = s (3*. 



1 — 2p exp 



2[A m ax(S e )max,- e 5 C \\Xj U21 



The proof of Theorem 1 is very similar to that of Lemma 3 in Jia and Rohe [2012] (pp. 
24). The only difference is that in Jia and Rohe [2012], they scale each column of X to be 
bounded with ||ATj||2 < 1. Here we do not have any assumption for the £2 norm of Xj. If 
we further have the assumption that ||ATj||2 < 1 for each j, then we have exactly the same 
result as in Jia and Rohe [2012]. So we omit the proof for Theorem 1 . 

The irrepresentable condition is a key condition for the Lasso's sign consistency. A lot 
of researchers noticed that the irrepresentable condition is a necessary condition for the 
Lasso's sign consistency [Zhao and Yu, 2006, Wainwright, 2009, Jia et al., 2010]. We also 
state this conclusion under a more general linear model with correlated noise terms. 
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Theorem 2. Suppose that data (X, Y) follow a linear model Y = X/3* + e, with Gaus- 
sian noise e ~ N(0,Y> e ). The irrepresentable condition (13) is necessary for the sign con- 
sistency of the Lasso. In other words, if 



(14) 

we have 



XlX s (XjX s ) 1 sign{^ s ] 



> 1, 



p(/?(a) = s n < 



1 



A proof of Theorem 2 can be seen in the appendix. Theorem 2 says that if the irrepre- 
sentable condition does not hold, it is very likely that the Lasso does not recover signs of 
the coefficients. 

With the above theorem, we now come back to the transformed Lasso problem defined 
in Equation (11) and examine if the irrepresentable condition holds or not in this case. 
Recall that for the Lasso problem transformed from the FLSA, we have the design matrix 



X = [X 1 



,X„ 



n-1 



x„ 



n-1 



Denote 5 = {j : 8* ^ 0} as the index set of the relevant variables in the true model. Let 
j be the index of any of the irrelevant variables. Then (13) can be written as 



X,- r X s {Xs T Xsr l sign{e*) 



< l,Vj 5 



which is equivalent to 



\bjsign(9*)\ < l,Vj#S 



with bj 6 M) s \ the OLS estimate of bj in the following linear regression equation 



(15) 



X,, 



b, T X s + e. 



Since X is the centered version of X, it can be easily shown that bj is also the OLS estimate 
of bj in the following linear regression equation: 



(16) 



Xj = b + bj 1 X s + e, 



where bo € M is the intercept term. 

A stronger version of irrepresentable condition is as follows 



(17) 



Xj X s (Xs Xs)- 1 < l,Vj 5 



If (17) holds, then for any /x* (equivalently, for any 9*) the irrepresentable condition always 
holds. Otherwiese, if (17) does not hold, then there exists some 9* such that the irrepre- 
sentable condition fails to hold. We have a necessary and sufficient condition on /x* such 
that the stronger version of the ir represent ale condition (17) holds. 
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Theorem 3. Assume y = (y\, . . . , y n ) satisfies model (1), the collection of the indexes 
of jump points is S = {jx, j'2, . . . , j s } with j k (l < k < s) increasing. Formally, S = {j : 
/4 7^ Mj-u 3 = 2, . . . ,n}. Then the stronger version of the irrepresentable condition (17) 
holds if and only if the jump points are consecutive. That is, s = 1 or 

max (j k+1 -j k ) = l. 

X<k<s 

Proof. Note that the OLS estimate of the coefficients in the linear regression equation 
(16) is 



(18) 



bo 



(Z S T Z S ) L Z S 1 Xj 



-1 , 



where Z s = ( l n X s ). We know that Z T S Z S = (t M ) € r("+i)x( s +i) with 

tkt = n- max{jk-i,jt-i}. 

where we assume jo = 0. According to a linear algebra result stated in Lemma 4 in the 
appendix, the inverse of this matrix is a tridiagonal matrix: 

T2X r 22 r 23 
r32 r 33 



{Z T s ZsY l 



where 



( x_ 

ji 



31-1-31-2 
1 

3 1 — 3 1-\ 

3l~3l-2 



{ji- 1 — 31-2 ) (Ji —Je-i ) 

n-j s -i 
(js-3s-i){n-js) 





^34 

Vs,s— 1 V.s,s T"s,s+X 
f s +X,s r s+l,s+l 

k = £=l 

k = £-l 
k = £ + l 
\<k = I < s + l 

k =£= 8+1 

otherwise. 



bo 



(Zs T Zs) 1 Zs T Xj. There are three pattern types that we need 



Denote v = 
to consider. 

(i) If there exists 1 < k < s such that jk+i — jk > 2, then for any j with j k < j < jk+i, 
z s T Xj = (n - j,n- j,...,n- j,n- j k+ i,n - j fc+2 , • • • , n - j s ) T ■ 



jfe+l 
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We have 
Hence, 

(19) | Nil 
(ii) If j < ji, 

We have 



Hence, \\bj\\i 
(hi) If i > j s , 



(0,...,0, 



Jk+i-J Jk+i-J 



3k+l — 3k Jk+1 — 3k 



3k+i ~ 3 



3k+l - 3k 



+ 



3k+i ~ 3 
jk+1 - 3 k 



+ 1 



+ i,o,...,o) : 



1, since j k < j < jk+i- 



Z s Xj = (n - j,n - ji,n - j 2 , ■ ■ ■ ,n - j s 

v=(l-^,^,0,...,0) T . 
31 31 



If I < 1- 
1 31 1 



Zs Xj = (n-j,.. . ,n - j) J 



We have 



Hence, ||6j||i 



s+l 



n — 3 \T 



(o,...,o,^) 



n - 3s 



n~3 



< 1. 



J 1 1- 1 i n-] 8 

These three cases for the position of j G S c show that as long as j is not between two 
jump points, ||6j||i < 1. Otherwise = 1. So 

s = 1 or max (J k+1 - j k ) = 1 

l<k<s 



is necessary and sufficient for all ||£y||i < 1, j G S c . 



□ 



The above theorem shows that only a few special structures on /j,* make the stronger 
version of the ir represent able condition hold. From the proof, we can propose a necessary 
and sufficient condition for the irrepresentable condition. 

Theorem 4. Assume y = (yi, . . . , y n ) satisfies model (1), the collection of the indexes 
of jump points are S = {ji, J2, ■■■ijs} with j k {l < k < s) increasing. Formally, S = {j : 
/i* 7^ = 2, . . . ,n}. Then the irrepresentable condition (13) holds if and only if one 

of the following two conditions holds. 

(1) The jump points are consecutive. That is, s = 1 or 



max (j k+1 - j k ) = 1. 

l<fc<s 
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(2) If there exists one group of data points (with more than 1 point) between some two 
jump points and these data point have the same expected signal strength, then the two 
jumps are of different directions (up or down). Formally, let jk and jk+i be two jump 
points and »*, = ■■■ = »* jk+1 -i, then Qf jh - /4-i)04 +1 " /4 +1 -i) < 0- 

Proof. From Theorem 3, if condition (1) in Theorem 4 holds, a stronger version of 
the irrepresentable condition holds and thus the ir represent able condition (13) holds. If 
condition (1) does not hold, then there exists two jump points jk and jk+i such that 
jk+i > jk + 2 and fi* k = . . . = M^ fc From Equation (19) in the proof of Theorem 3, we 
see that the irrepresentable condition (13) holds if and only if 6^ and k+1 have different 
signs. By the definition of 9, we see that (fi* k — / u ^ fc -i)(/ i j fc+1 ~~ M| fe x -i) < is equivalent 
to d\ and 9* k+l having different signs. □ 

Theorem 4 says that only a few configurations of /i* make the irrepresentable condition 
hold. In practice, a lot of signal patterns do not satisfy either of the two conditions listed 
in Theorem 4. For the Lasso problem, to comply with the irrepresentable condition, Jia 
and Rohe [2012] proposed a Puffer Transformation. We now introduce the Puffer Trans- 
formation and apply it to solve the fused Lasso problem, which we call the preconditioned 
fused Lasso. 

4. Preconditioned Fused Lasso. Jia and Rohe [2012] introduces the Puffer Trans- 
formation to the Lasso when the design matrix does not satisfy the irrepresentable con- 
dition. They showed that when n > p, even if the Lasso is not sign consistent, after the 
Puffer Transformation, the Lasso is sign consistent under some mild conditions. 

We assume that the design matrix X G M. nxp has rank d = min{n,p}. By the singular 
value decomposition, there exist matrices U 6 M. nxd and V 6 M. pxd with U T U = V T V = Id 
and a diagonal matrix D £ ~R dxd such that X = UDV'. Define the Puffer Transforma- 
tion [Jia and Rohe, 2012], 

(20) F nxn = UD- X U T . 

The preconditioned design matrix FX has the same singular vectors as X. However, all of 
the nonzero singular values of FX are set to unity: FX = UV . When n > p, the columns 
of FX are orthonormal. When n < p, the rows of FX are orthonormal. Jia and Rohe [2012] 
has a non-asymptotic result for the Lasso on (FX, FY) stated as follows. 

Theorem 5 (Jia and Rohe [2012]). Suppose that data (X, Y) follow a linear model Y = 
X/3* + e, where Y = ( yi ,...,y n ) T eR nxl , X eR nxp with xf as its ith row, /3* eF xl and 
e = (ei, . . . , e n ) T G M nxl with e ~ iV(0, cr 2 7 n ). Define the singular value decomposition o/X 
as X = UDV' . Suppose that n > p and X has rank p. We further assume that the minimal 
eigenvalue A m i n (^X'X) > C mm > 0. Define the Puffer Transformation, F = UD^U 7 '. 
Let Z = FX and a = FY . Define 
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/3(A) = aiymin-\\a-Zb\\l + X\\b\\i. 
b 2 

If min j g s |/3? | > 2 A, i/ien probability greater than 

/ 01 \ i o / n^Cininl 
(21) l-2pexp<^ f 

^(A) = s /?*• 

The proof of Theorem 5 can be found in Jia and Rohe [2012]. From the proof we see 
that the assumption that e ~ iV(0, cr 2 I n ) can be relaxed to e ~ N(0, S) with maxj Y^a < a 2 . 
Compare Theorem 5 to Theorem 1, we see that with the Puffer Transformation, the Lasso 
does not need the irrepresentable condtion any more. 

The FLSA problem can be transformed to a standard Lasso problem. We have already 
shown that for most configurations of /x*, the design matrix X does not satisfy the ir- 
representable condition. Now we turn to the Puffer Transformation and obtain a concrete 
non-asymptotic result for the preconditioned fused Lasso. First we have the following result 
on the singular values of X . 

Lemma 2. X E R nx ( n ~ 1 ) is defined in Equation (10). Let aj(-) denote the j-th largest 
singular value of a matrix. Then 

ai(X) > <r 2 {X) > ■■■> cr n _i(X) > 0.5. 

A proof of Lemma 2 can be found in the appendix. With the lower bound on singular 
values of X and applying Theorem 5, we have the following result for our preconditioned 
fused Lasso. 

Theorem 6. Assume y = (y±, . . . ,y n ) satisfies model (1). X and Y are defined in 
Equation (10). Let 9* =A~ l /j,*, (equivalently, 9\ = = //* — /i*_D i = 2, . . . , n), where 

A is defined in Equation (6). Let 9* £ M™" 1 = (9^,9^, . . . ,#*) T . Define the singular value 
decomposition of X as X = UDV . Denote the Puffer Transformation, F = XJD~^XJ T . Let 
Z = FX and a = FY. Define 



(22) /3(A) = argmin 1 \\a - Zb\\ 2 2 + A||6|| 

b 2 

If minj>2 ) e*^o \0j\ > 2A, then with probability greater than 

1 — 2n exp 

/3(A) =J*. 



8a 2 J 
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Proof. By Equation (12) 
where e = e — e with E(e) = 0. 



Y = X6* + e, 



/ \ / — \ ^ ^~ 2 2 

varUi) = varUi — e) = a < a . 

n 

According to the comments below Theorem 5, we can apply Theorem 5 to have a lower 
bound on _P(/3(A) = s 9*). Let s\ < S2 < • • • < s n be the singular values of X. From Lemma 
2, si > 0.5. So A m j n (^-X'X)=^ > Put C m i n = ^ in expression (21) and note that X 
has n — 1 columns, we have 

AM f AM 

. ^ > 1 — 2n exn <! . i> . 

□ 



P(f3(X) = s 9*) > 1 - 2(n - 1) exp j > 1 - 2nexp . ^ 2 ^ . 



By the relationship between 9* and if /3(A) - the estimate of 9* has the sign recovery 
property, then the estimate of (X* defined as follows has the property of pattern recovery. 



(23) A* = A9* 

with 

9* = [§i, /3(A)] and 0i = y - X/3(A). 

Theorem 6 shows that the ability of pattern recovery depends on the signal jump strength 
(min J >2 ) e*^o \0j I) an d the noise level a 2 . To get a pattern-consistent estimate, we need 
a small enough and minj>2,e*^o \&j I big enough. To think about the small a 2 issue, we 
can treat each yi as an average of multiple Gaussian measurements. If the number of 

2 

measurements is m, then a 2 = -7 with some constant cig. If m > log(n), we can find a 
very small A to make the estimator defined in Equation (23) have the pattern recovery 
property. One choice of A is such that A 2 = ~^=~^~- F° r this choice of A, the probability 

of ft* =j s fi* is greater than 1 — 2 exp f— — 1] log(n + 1)^ , which goes to 1 as m goes 
to 00. 

In the next section, we use simulations to illustrate that for general signal patterns, the 
FLSA does not have the pattern recovery property while the preconditioned fused Lasso 
has, which enhances our findings. 

5. Simulations. We use simulation examples to confirm our theorems. We first set 
the model to be 



(24) 



Vi = f4 + e «> 
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where 



0: 

-2, 
-0.1, 

2, 

0.1, 

-2, 
0, 



1 < i < 100 
101 < % < 110 
111 < i < 210 
211 < i < 220 
221 < i < 320 
321 < i < 330 
331 < i < 430 



and the errors are i.i.d. Gaussian variables with mean and standard deviation a = 0.25. 
This one is similar to the example in Rinaldo [2009] except that the noise here is larger 
(a = 0.2 in Rinaldo [2009]). Figure 2 shows one sequence of sample data (points) along 
with the true expected signal (lines). 



CM 



CM 
I 







100 



200 
Index 



300 



400 



Fig 2. Sample data (points) and the expected signals (lines). 



From Figure 2 we see that the data points are grouped into seven clusters and featured 
by three spikes. The points can be well separated due to small noise. We will use this 
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typical example to compare the performances of the two methods, the FLSA and the 
preconditioned fused Lasso, in recovering the signal patterns. There are many criteria that 
can be used in comparison. In the context of pattern recovery, it is natural to define a 
loss function, which we call the pattern loss (I pa) of the recovered sequence of signals 
fx* = p,*,, . . . , fx* n ) E W 1 as follows: 

*PA(fi*) = \{i ■ sign(A-+i - Ai) + sign(pt* +1 - n*),i = 1, . . . ,n - 1}\ 

where fx* = fx*,,. . . , //*) is the expected signals and | • | the cardinality of a set. Note 
that the pattern loss achieves if and only if the pattern of the signals is recovered exactly. 
We compare the solutions under the two methods (FLSA and preconditioned fused lasso). 
For each method, the solution chosen is the one that minimizes the pattern loss on the 
solution path. 

We first apply the FLSA to estimate fx*, i = 1,2, ... , 430. When calculating the FLSA 
solution, we use a path algorithm proposed by Hoefling [2010] which is very efficient to give 
the whole solution path of the FLSA. An R package ( "flsa" ) for this algorithm is available in 
http:/ /cran.r-project.org/web/packages/flsa/index.html. In fact, the whole FLSA solution 
path is piecewise linear in A. "flsa" only stores the sequence of A's when the direction of the 
linear function changes. Note that the pattern loss does not change with A on every linear 
piece of the solution path. By comparing the signal pattern of all the estimated signals 
on the solution path with the true signals fx* , we see that there is no one solution that 
recovers the original signal pattern. That is, all the FLSA solutions have a positive pattern 
loss. We present in Figure 3 (left panel) the solution that minimizes such loss. We see that 
this estimate is just the trivial estimate that averages all the signals, which obviously does 
not give satisfactory recovery of signal patterns. 

For each A in the sequence, we also calculate the common l<i distance between the 
estimated signals li* and the true ones. The estimate with the smallest £2 distance is 
reported in Figure 3 (right panel). We see that for this estimate, it does not recover the 
original signal pattern either. 

To compare, we calculate the solution of the Lasso defined in Equation (22). After the 
SVD and the Puffer Transformation, this becomes much easier. We only need to do a 
soft-thresholding with the given A. This is because 

Z T Z = X T F T FX = (VDU T )(UD- 1 U T )(UD- 1 U T )(UDV T ) = I n 

and the property of the Lasso allows us to solve it directly by soft-thresholding 

6(A) = SH x (Z T a). 

Obviously, 6(A) is also a piecewise linear function in A and the break points are Aj = 
|Z T a|(j),i = 1,2, ... ,n, where xu\ denotes the ith largest value in vector x and n is the 
dimension of vector Z T a. On the solution path, for each \ we have an estimate fx for 
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Fig 3. FLSA solutions. Left panel: the FLSA solution(in black lines) with tuning parameter selected by 
minimizing the pattern loss of fi* ; Right panel: the FLSA solution(in black lines) with tuning parameter 
selected by minimizing the £2 error between fj* and n* . The red lines are the expected signal sequence. 



fjL* . By examining the pattern loss of the solutions on the path, we find that there is one 
solution (in fact any solution on that linear part between the one chosen and the one at 
the previous breakpoint) has loss, which means it recovers the signal pattern exactly. We 
report this solution in Figure 4. 

Note that the reported estimate is very biased from the expected value. There is a tradeoff 
between the unbiasedness and the quality of pattern recovery. One possible solution for the 
unbiasedness is via a two-stage estimator- for the first stage the signal patten is recovered 
and for the second stage an unbiased estimate is obtained. 

The above example just gives one data set to compare the performances of the FLSA 
and preconditioned fused Lasso. We now randomly draw 1000 datasets and compare the 
approximate probability (denoted as P) that there exists a A such that the pattern of the 
signals can be completely recovered. The results are as follows. 

• FLSA: P w 0. 

• Preconditioned Fused Lasso P « 0.926. 

This example again illustrates the strength of our algorithm in pattern recovery of blocky 
signals. Nevertheless, as intuitively, it loses power like other recovering algorithms when 
the noise level becomes stronger and makes it difficult to tell the boundaries between the 
blocks. Our theorem also reflects this relationship between recovery probability and noise 
level. In the next example, we change a from 0.1 to 0.4 and compute the probability of 



FUSED LASSO 



17 



OJ 
I 







100 



200 
Index 



300 



400 



Fig 4. The preconditioned fused Lasso estimator (in black lines). Tuning parameter is selected by finding 
a solution which exactly recovers the signal pattern. The red line is the expected signal sequence. 



pattern recovery. For each a, we randomly draw 1000 datasets following model described 
in Equation (24) and obtain the estimated probability via the proportion that there exists 
a A such that the pattern of the signals can be completely recovered. The estimated proba- 
bilities are reported in Figure 5, from which we see that the probability of pattern recovery 
under small noise is extremely high but this cannot hold when the signals are corrupted by 
stronger noise, which makes the boundaries between groups vague and hard to distinguish. 

6. Conclusions and Discussions. In this paper we provided more understanding of 
the FLSA and shed some light on the insight of the FLSA. The FLSA can be transformed 
to a standard Lasso problem. The sign recovery of the transformed Lasso problem is equiv- 
alent to the pattern recovery of the FLSA problem. Theoretical analysis showed that the 
transformed Lasso problem is not sign consistent in most situations. So the FLSA might 
also meet this consistency problem when it is used to recover signal patterns. To overcome 
such problem, we introduced the preconditioned fused Lasso. We gave non-asymptotic re- 
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Fig 5. The estimated probability of pattern recovery under different noise level for the preconditioned fused 
Lasso. Each point is estimated with 1000 randomly generated datasets. 



suits on the preconditioned fused Lasso. The result implies that when the noise is weak, 
the preconditioned fused Lasso can recover the signal pattern with high probability. We 
also found that the preconditioned fused Lasso is sensitive to the scale of the noise level. 
Simulation studies confirmed our findings. 

One may argue that we only considered the pattern recovery using the fusion regulariza- 
tion parameter A2 and that the sparsity regularization parameter Ai can be used to adjust 
to the right pattern. However, remember that the main purpose of introducing Ai is for the 
sparsity of the model. It is not statistically reasonable to use this regularization parameter 
only to recover the blocks. 

If considered in the context of both sparsity and block recovery, this is impossible in 
most time. Using the example above, we claim that the pattern and sparsity cannot be 
recovered at the same time by FLSA. We know from Friedman et al. [2007] that as long 
as two parameters are fused together for some A 2 °\ it will be fused for all A2 > A2 . This 
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implies inversely that if two fi^s are partitioned for some big A2, they will not be fused for 
all A2 < As- Let us focus on the first partition as A2 decreases from some large value when 
all the estimated parameters are fused together. We found it happens at Point 210, which 
was not a jump point in the original signal sequence fj,*. Then Point 209 (/i209 ~ —0.063) 
and Point 210(/i2io ~ —0.052) can never be fused together when A2 goes down. The only 
way to make them together is to do drag both of them to zero in the soft-thresholding step. 
But they are nonzero signals(/i2og = ^210 = — 0-1) an d the sparsity recovery will be clearly 
violated if doing so. 

We claim that a good pattern recovery will facilitate things afterwards. The precondi- 
tioned fused Lasso is reliable for pattern recovery, and so it can be incorporated into other 
processes - such as the recovery of sparsity. 

Appendix 

We prove some of our theorems in the appendix. 

APPENDIX A: PROOF OF THEOREM 2 

We first give a well-known result that makes sure the Lasso exactly recovers the sparse 
pattern of /3*, that is /3(A) = s /3*. The following Lemma gives necessary and sufficient 
conditions for sign(/3(A)) = sign(/3*), which follows from the KKT conditions. The proof 
of this lemma can be found in Wainwright [2009] . 

Lemma 3. For the linear model Y = X/3* +e, assume that the matrix XgXs is invert- 
ible. Then for any given A > and any noise term e G W 1 , there exists a Lasso estimate 
/3(A) described in Equation (4) which satisfies /3(A) = s /3* ; if and only if the following two 
conditions hold 

(25) \xlXsiXjXs)- 1 [X T s e - Xsign((3* s )] - X T Sc e\ < A, 

(26) sign ((3* s + (X^Xs)- 1 [xje - \sign(p* s )}) = sign{f3* s ), 

where the vector inequality and equality are taken elementwise. Moreover, if the inequality 
(25) holds strictly, then 

/3 = (/3 (1) ,0) 

is the unique optimal solution to the Lasso problem in Equation (4), where 

(27) /3W = f3* s + (XgXs)' 1 [X T s e - Xsign(f3*)] ■ 

Remarks. As in Wainwright [2009], we state an equivalent condition for (25). Define 

b = sign(f3* s ), 
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and define 



Vj = Xj {Xs(X T s X s )- l \t - [Xs(X T s X s )- l X T s - I)} e} 
By rearranging terms, it is easy to see that (25) holds if and only if 



(28) 



M(V) 



max I Vi I < A 



holds. 

With Lemma 3 and the above comments, now we prove Theorem 2. Without loss of 
generality, assume for some j E S c and £ > 0, 



Xj Xs (x s x s 



1 + C- 



Then 



V j = X(l + C) + V j , 



where Vj = —Xj [Xg [Xs 1 Xs) 1 Xs T — 1]^ is a Gaussian random variable with mean 0, 
so P(Vj > 0) = \. Therefore, 

P(Vj > A) > \ 

and the equality holds when C = 0. This implies that for any A, Condition (25) (a necessary 
condition) is violated with probability greater than 1/2. 

In the proof of Theorem 3, we need an algebra result as follows. 

Lemma 4. For k > 3, a\, . . . , G M. and are not equal to each other. A — (fljjOfcxfci 
with aij = ai where £ = max{i,j}. That is, 



( 



A 



a2 02 



a k 



Then the inverse of A 



{A) 



-i 



rn ri2 

r2i r 2 2 r 2 3 

r32 r 33 r 34 



r k-l,k-2 r k-l,k-l r k~l,k 
l~k,k~l Tkb 
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where 



( 1 



11— 0,2 



aj-i-aj 
1 

a j-i~ a j+i 



i = j = 1 
i = j-l 

i = j + l 
I < i = j < k 



(aj-i-a,j)(aj-a,j + i) 

(Ofc_l-O fc )(o fc ) J 

^0 otherwise. 
Proof. This lemma can be directly verified via the following equations: 
^2 a ji r ij = 1 and ^2 aunj = 0, £ / j. 

We first verify Yli a ji r ij = 1, for all j. 
When j = l, 

Eaurn = a ± ru + a 2 r 21 = al h 
n.i — nr. 



a 2 

a\ — a.2 a\ — a 2 



When 1 < j < k 



a j,j-^ r j-^,j + a j,j r j,j + a i,i+i r i+i,i 

! a i-i ~ a i+i 



+ a 



"i ■ "i ( a j-i ~ a j)( a j ~ Oj+i) 



+ GU+l • 



= 1. 



When j = k, 



afc.fe-i^fe-i.fc + ak,kTk,k 

-1 dfc-i 



1. 



(afe-i - OfcW 



We next verify Yli a & r ij = f° r an ^ 3- We only very the general case when 
there are three elements in one column of A^ 1 . The other verifications are the same. 
J2i a £i r ij = a e,j-i r j-i,j + a £,j r j,j + a £,j+i r j+i,j- Since £ / j, there are only two situations 
we need to consider. (1) £ < j — 1 and (2) £ > j + 1. 
When £ < j - 1, 

^ o«rij = a^_irj_ij + a^r,j + a ej+1 r j+lij 

i 

= a i-i r 3-i,j + a j r j,j + «i+i r i+i,i 
= 0. 
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When £>j + l, 



1 dj-l — Qj+l 



0. 



+ 



+ 



i-i — a i ( a i-i — a j)( a j ~ a j+i) a j — a j+i 



□ 

APPENDIX B: PROOF OF LEMMA 2 
To prove Lemma 2, we need the following two results. 

Lemma 5. Let X G M nxn be a lower triangular matrix with elements 1 on and below 
the diagonals and in other places. 



X, 



1 i>j 
i < j. 



The minimal singular value is greater or equal to 0.5. 

Proof. Let X = (aij) G M nxn be the matrix satisfying the condition of the lemma. 
Note that the singular values of this matrix X are the non-negative square roots of the 
eigenvalues of X T X. Hence it suffices to show that all the eigenvalues of X T X are greater 
or equal to 0.25. 

The explicit expression of C n = X T X = (aj) G R nxn is 

(Hj =n+l — max{i, j}. 



By Lemma 4, we have 



a 



-i 



i -i 

-1 2 -1 
-1 2 



-1 2 



FUSED LASSO 



23 



Then for any vector u € 



tnxl 



n n—1 

u T C- l u = ul + Y,( 2u i) ~ 2 J2 UiUi + 1 



i=2 i=l 
n-1 



< 2 u 4 2 - 2 ^ tijUj+i 

i=l i=l 
n n—1 

i=l i=l 

By the fact that XT=i K n «+il < 5 Z)i=i + u l+i) - Z)2=i ""j 2 ' we nave 



i=i 



which implies that the eigenvalues of C n 1 are less or equal to 4 and thus the eigenvalues 
of C n are all greater or equal to 0.25. 

□ 



The following lemma states the relationship between eigenvalues of second moments for 
centered and non-centered data. Let X E ~EL nxp be a data matrix. Define the (empirical) 
covariance matrix of X be 

_ x' c x c 



where X c is the centered version of X with the j-th column of X c be Xj — Xj. Let the 
second moments of the data set X be 

X'X 
T = . 

n 

Then the eigenvalues of S and T have the following property. 

Lemma 6 (Cadima and Jolliffe [2009]). Let S be the covariance matrix for a given data 
set, and T its corresponding matrix of non-central second moments. Let Xj(-) be the j-th 
largest eigenvalue of a matrix. Then 

\ 1+ i(T) < \j(S) < \ 3 (T). 

Lemma 6 can be found on page 5 in Cadima and Jolliffe [2009]. 

With the results from Lemma 5 and Lemma 6, we now prove Lemma 2. 
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Proof. Let X E M nxn be a lower triangular matrix with elements 1 on and below the 
diagonals and in other places. 




Let C7j(-) denote the j-th largest singular value of a matrix. By Lemma 5, the smallest 
singular value is not less than 0.5, that is crj(X) > 0.5, Vj = 1, . . . , re. Now let X c be the 
centered version of X, then X c = [0,X], where is a column vector with all elements 0, 
and X G M™ x ( n_1 ) as defined in Equation (10). Let Cj(-) denote the j-th largest singular 
value of a matrix. By Lemma 6, we have 

a j+ i(X) < aj(X c ) < a 3 (X),Vj = 1, 2, . . . , n - 1. 

In particular, take j = n— 1 in the above inequalities and we have a n -i(X c ) > a n (X) > 0.5. 
Since X c is singular, the minimal singular value a n (X c ) = 0. Therefore, 

a n -x(X) =a„_i(X c ) > 0.5. 

□ 
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